{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <font color='green'> Regression (Sparse)<font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset - E2006-tfidf\n",
    "\n",
    "### <font color='green'> 1. Description<font>\n",
    "Financial10-K reports from thousands of publicly traded U.S. companies, published in 1996–2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report.\n",
    "\n",
    "Here the target variable (y) is the risk associated with buying a particular stock. And the features (X) are the volatility of the stock price return over different periods of time.\n",
    "\n",
    "Download Link:\n",
    "Train Data: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.train.bz2 <br>\n",
    "Test Data: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/E2006.test.bz2\n",
    "\n",
    "\n",
    "Data source:\n",
    "1. http://www.cs.cmu.edu/~ark/10K/\n",
    "2. http://www.cs.cmu.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf\n",
    "3. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 2. Data Preprocessing <font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import time\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from collections import OrderedDict\n",
    "from sklearn import datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess_data(filename_train, filename_test, feature_dim):\n",
    "    '''\n",
    "    For E2006-tfidf regression we will perform some data preparation and data cleaning steps.\n",
    "    '''\n",
    "    x_train, y_train = datasets.load_svmlight_file(\n",
    "        filename_train,\n",
    "        n_features=feature_dim,\n",
    "        dtype=np.float32)\n",
    "    x_test, y_test = datasets.load_svmlight_file(\n",
    "        filename_test,\n",
    "        n_features=feature_dim,\n",
    "        dtype=np.float32)\n",
    "    return x_train, y_train, x_test, y_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shape of train data: (16087, 150360)\n",
      "shape of test data: (3308, 150360)\n"
     ]
    }
   ],
   "source": [
    "#---- Data Preparation ----\n",
    "FILENAME_TRAIN = \"datasets/E2006.train\"\n",
    "FILENAME_TEST = \"datasets/E2006.test\"\n",
    "FEATURE_DIM = 150360\n",
    "x_train, y_train, x_test, y_test = preprocess_data(FILENAME_TRAIN, FILENAME_TEST, FEATURE_DIM)\n",
    "print(\"shape of train data: {}\".format(x_train.shape))\n",
    "print(\"shape of test data: {}\".format(x_test.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 3. Algorithm Evaluation <font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_time = []\n",
    "test_time = []\n",
    "train_score = []\n",
    "test_score = []\n",
    "estimator_name = []"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def evaluate(estimator, estimator_nm,\n",
    "             x_train, y_train,\n",
    "             x_test, y_test):\n",
    "    '''\n",
    "    To generate performance report for both frovedis and sklearn estimators\n",
    "    '''\n",
    "    estimator_name.append(estimator_nm)\n",
    "\n",
    "    start_time = time.time()\n",
    "    estimator.fit(x_train, y_train)\n",
    "    train_time.append(round(time.time() - start_time, 4))\n",
    "\n",
    "    start_time = time.time()\n",
    "    train_score.append(estimator.score(x_train, y_train))\n",
    "    test_score.append(estimator.score(x_test, y_test))\n",
    "    test_time.append(round(time.time() - start_time, 4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1 LassoRegressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "TARGET = \"lasso_regressor\"\n",
    "import frovedis\n",
    "from frovedis.exrpc.server import FrovedisServer\n",
    "FrovedisServer.initialize(\"mpirun -np 8 \" +  os.environ[\"FROVEDIS_SERVER\"])\n",
    "from frovedis.mllib.linear_model import Lasso as fLR\n",
    "f_est = fLR(lr_rate=4.89E-05)\n",
    "E_NM = TARGET + \"_frovedis_\" + frovedis.__version__\n",
    "evaluate(f_est, E_NM, x_train, y_train, x_test, y_test)\n",
    "f_est.release()\n",
    "FrovedisServer.shut_down()\n",
    "\n",
    "import sklearn\n",
    "from sklearn.linear_model import Lasso as sLR\n",
    "s_est = sLR(alpha=0.01)\n",
    "E_NM = TARGET + \"_sklearn_\" + sklearn.__version__\n",
    "evaluate(s_est, E_NM, x_train, y_train, x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <font color='green'> 4. Performance Summary <font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                         estimator  train time  test time  train-score  \\\n",
      "0  lasso_regressor_frovedis_0.9.10      3.4577     0.1509     0.528365   \n",
      "1   lasso_regressor_sklearn_0.24.1      7.2052     0.0583     0.653095   \n",
      "\n",
      "   test-score  \n",
      "0    0.315151  \n",
      "1    0.516401  \n"
     ]
    }
   ],
   "source": [
    "summary = pd.DataFrame(OrderedDict({ \"estimator\": estimator_name,\n",
    "                                     \"train time\": train_time,\n",
    "                                     \"test time\": test_time,\n",
    "                                     \"train-score\": train_score,\n",
    "                                     \"test-score\": test_score\n",
    "                                  }))\n",
    "print(summary)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
